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Abstract 

Markov decision processes (MDPs) are widely 
used for modeling decision-making problems in 
robotics, automated control, and economics. Tra- 
ditional MDPs assume that the decision maker 
(DM) knows all states and actions. However, 
this may not be true in many situations of in- 
terest. We define a new framework, MDPs with 
unawareness (MDPUs) to deal with the possibil- 
ities that a DM may not be aware of all possible 
actions. We provide a complete characterization 
of when a DM can learn to play near-optimally 
in an MDPU, and give an algorithm that learns 
to play near-optimally when it is possible to do 
so, as efficiently as possible. In particular, we 
characterize when a near-optimal solution can be 
found in polynomial time. 



1 INTRODUCTION 

Markov decision processes (MDPs) |2] Q~2] [16] have been 
used in a wide variety of settings to model decision mak- 
ing. The description of an MDP includes a set S of possible 
states and a set A of actions. Unfortunately, in many de- 
cision problems of interest, the decision maker (DM) does 
not know the state space, and is unaware of possible actions 
she can perform. For example, someone buying insurance 
may not be aware of all possible contingencies; someone 
playing a video game may not be aware of all the actions 
she is allowed to perform nor of all states in the game. 

The fact that the DM may not be aware of all states does not 
cause major problems. If an action leads to a new state and 
the set of possible actions is known, we can use standard 
techniques (discussed below) to decide what to do next. 
The more interesting issue comes in dealing with actions 
that the DM may not be aware of. If the DM is not aware 
of her lack of awareness then it is clear how to proceed — 
we can simply ignore these actions; they are not on the 



DM's radar screen. We are interested in a situation where 
the DM realizes that there are actions (and states) that she 
is not aware of, and thus will want to explore the MDP. We 
model this by using a special explore action. As a result of 
playing this action, the DM might become aware of more 
actions, whose effect she can then try to understand. 

We have been deliberately vague about what it means for a 
DM to be "unaware" on an action. We have in mind a set- 
ting where there is a (possibly large) space A* of potential 
actions. For example, in a video game, the space of poten- 
tial actions may consist of all possible inputs from all input 
devices combined (e.g., all combinations of mouse move- 
ments, presses of keys on the keyboard, and eye movements 
in front of the webcam); if a DM is trying to prove a theo- 
rem, at least in principle, all possible proof techniques can 
be described in English, so the space of potential actions 
can be viewed as a subset of the set of English texts. The 
space A of actual actions is the (typically small) subset of 
A* that are the "useful actions". For example, in a video 
game, these would be the combinations of arrow presses 
(and perhaps head movements) that have an appreciable ef- 
fect on the game. Of course, A* may not describe how the 
DM conceives of the potential acts. For example, a first- 
time video-game player may consider the action space to 
include only presses of the arrow keys, and be completely 
unaware that eye movement is an action. Similarly, a math- 
ematician trying to find a proof probably does not think 
of herself as searching in a space of English texts; she is 
more likely to be exploring the space of "proof techniques". 
A sophisticated mathematician or video game player will 
have a better understanding of the space that she views her- 
self as exploring. Moreover, the space of potential actions 
may change over time, as the DM becomes more sophisti- 
cated. Thus, we do not explicitly describe A* in our formal 
model, and abstract the process of exploration by just hav- 
ing an explore action. (It actually may make sense to have 
several different explore actions; we defer discussion of this 
point to Section[6]) 

This type of exploration occurs all the time. In video 
games, first-time players often try to learn the game by ex- 



ploring the space of moves, without reading the instructions 
(and thus, without being aware of all the moves they can 
make). Indeed, in many games, there may not be instruc- 
tions at all (even though players can often learn what moves 
are available by checking various sites on the web). Math- 
ematicians trying to generate new approaches to proving a 
theorem can be viewed as exploring the space of proof tech- 
niques. More practically, in robotics, if we take an action 
to be a "useful" sequence of basic moves, the space of po- 
tential actions is often huge. For instance, most humanoid 
robots (such as Honda Asimo robot lfl"8l ) have more than 
20 degrees of freedom; in such a large space, while robot 
designers can hand-program a few basic actions (e.g., as 
walking on a level surface), it is practically impossible to 
do so for other general scenarios (e.g., walking on uneven 
rocks). Conceptually, it is useful to think of the designer as 
not being aware of the actions that can be performed. Ex- 
ploration is almost surely necessary to discover new actions 
necessary to enable the robot to perform the new tasks. 

Given the prevalence of MPDUs — MDPs with unaware- 
ness, the problem of learning to play well in an MDPU 
becomes of interest. There has already been a great deal of 
work on learning to play optimally in an MDP. Kearns and 
Singh [ 15 1 gave an algorithm called E 3 that converges to 
near-optimal play in polynomial time. Brafman and Ten- 
nenholtz |3| later gave an elegant algorithm they called 
Rmax that converges to near-optimal play in polynomial 
time not just in MDPs, but in a number of adversarial set- 
tings. Can we learn to play near-optimally in an MDPU? 
(By "near-optimal play", we mean near-optimal play in 
the actual MDP.) In the earlier work, near-optimal play in- 
volved learning the effects of actions (that is, the transition 
probabilities induced by the action). In our setting, the DM 
still has to learn the transition probabilities, but also has to 
learn what actions are available. 

Perhaps not surprisingly, we show that how effectively 
the DM can learn optimal play in an MDPU depends on 
the probability of discovering new actions. For example, 
if it is too low, then we can never learn to play near- 
optimally. If it is a little higher, then the DM can learn 
to play near-optimally, but it may take exponential time. 
If it is sufficiently high, then the DM can learn to play 
near-optimally in polynomial time. We give an expression 
whose value, under minimal assumptions, completely char- 
acterizes when the DM can learn to play optimally, and how 
long it will take. Moreover, we show that a modification of 
the Rmax algorithm (that we call URmax) can learn to 
play near-optimally if it is possible to do so. 

There is a subtlety here. Not only might the DM not be 
aware of what actions can be performed in a given state, she 
may be unaware of how many actions can be performed. 
Thus, for example, in a state where she has discovered five 
actions, she may not know whether she has discovered all 
the actions (in which case she should not explore further) 



or there are more actions to be found (in which case she 
should). In cases where the DM knows that there is only 
one action to be discovered, and what its payoff is, it is 
still possible that the DM never learns to play optimally. 
Our impossibility results and lower bound hold even in this 
case. (For example, if the action to be discovered is a proof 
that P / NP, the DM may know that the action has a high 
payoff; she just does not know what that action is.) On the 
other hand, URmax works even if the DM does not know 
how many actions there are to be discovered. 

There has been a great deal of recent work on awareness in 
the game theory literature (see, for example, [5,6, 9 1 [Till ). 
There has also been work on MDPs with a large action 
space (see, for example (TO)), and on finding new ac- 
tions once exploration is initiated H| . None of these papers, 
however, considers the problem of learning in the presence 
of lack of awareness; we believe that we are the first to do 
so. 

The rest of the paper is organized as follows. In Sec- 
tion [21 we review the work on learning to play optimally 
in MDPs. In Section [3] we describe our model of MD- 
PUs. We give our impossibility results and lower bounds 
in SectionH] In Section[5] we present a general learning al- 
gorithm (adapted from R-MAX) for MDPU problems, and 
give upper bounds. We conclude in Section [6] Missing 
proofs can be found in the full paper. 

2 PRELIMINARIES 

In this section, we review the work on learning to play op- 
timally in MDPs and, specifically, Brafman and Tennen- 
holt's R-MAX algorithm 0. 

MDPs: An MDP is a tuple M = (S,A,P,R), where 
S is a finite set of states; A is a finite set of actions; 
P : (S X S X A) — >■ [0, 1] is the transition probability 
function, where P(s, s' , a) gives the transition probabil- 
ity from state s to state s' with action a; and R : (S x 
S x A) — > M + is the reward function, where i?(s, s',a) 
gives the reward for playing action a at state s and tran- 
siting to state s'. Since P is a probability function, we 

have X^s'es ^ > ( s ' s '' °) = 1 f° r all s £ <? and a E A. 
A policy in an MDP (S, A, P, R) is a function from histo- 
ries to actions in A. Given an MDP M = (S, A, P, R), 
let Um{s,it,T) denote the expected T-step undiscounted 
average reward of policy tt started in state s — that is, the 
expected total reward of running it for T steps, divided 
by T. Let Um(s,tt) = linrr^oo (s, tt, 7"), and let 
U m (tt) = min seS U M (s, n). 

The mixing time: For a policy it such that Um{k) = a, 
it may take a long time for 7r to get an expected payoff of a. 
For example, if getting a high reward involves reaching a 
particular state s*, and the probability of reaching s* from 



some state s is low, then the time to get the high reward 
will be high. To deal with this, Kearns and Singh lfT31 ar- 
gue that the running time of a learning algorithm should be 
compared to the time that an algorithm with full informa- 
tion takes to get a comparable reward. Define the e-return 
mixing time of policy tt to be the smallest value of T such 
that tt guarantees an expected payoff of at least U(tt) — e; 
that is, it is the least T such that U(s,ir,t) > U(tt) - e 
for all states s and times t > T. Let n(e, T) consist 
of all policies whose e-mixing time is at most T. Let 
Opt(M,e,T) = max 7re n( e ,T) U m {k). 

Rmax: Rmax is a model-based near-optimal 
polynomial-time reinforcement learning algorithm for 
zero-sum stochastic games (SG), which also directly ap- 
plies to standard MDPs. Rmax assumes that the DM 
knows all the actions that can be played in the game, but 
needs to learn the transition probabilities and reward func- 
tion associated with each action. It does not assume that the 
DM knows all states; new states might be discovered when 
playing actions at known states. Rmax follows an implicit 
"explore or exploit" mechanism that is biased towards ex- 
ploration. Here is the Rmax algorithm: 

Rmax(|5|, \A\,R 

max • T,e,5, s ): 

Set K\(T) := max^ 4 ' 5 ' ^ [-6 ln^^^)] ) + 
Set M' := M° (the initial approximation described below) 
Compute an optimal policy tt' for M' 
Repeat until all action/state pairs (s, a) are known 

Play tt' starting in state sq for T steps or until some new 

state-action pair (s, a) is known 
if (s, a) has just become known then update M' so that 
the transition probabilities for (s, a) are the observed 
frequencies and the rewards for playing (s, a) are those 
that have been observed. 
Compute the optimal policy tt' for M' 
Return tt'. 

Here i? max is the maximum possible reward; e > 0; 
< 8 < 1; T is the e-return mixing time; K±(T) repre- 
sents the number of visits required to approximate a transi- 
tion function; a state-action pair (s, a) is said to be known 
only if it has been played K\(T) times. Rmax proceeds 
in iterations, and M' is the current "approximation" to the 
true MDP. M' consists state set S and a dummy state Sd- 
The transition and reward functions in M' may be different 
from those of the actual MDP. In the initial approximation 
M°, the transition and reward functions are trivial: when 
an action a is taken in any state s (including the dummy 
state S4), with probability 1 there is a transition to Sd, with 
reward i? max . 

Brafman and Tennenholtz Q show that 
Rmax(|5|, \A\, -Rmax, T,c,S,sq) learns a policy with 
expected payoff within e of Opt(M, e, T) with probability 



greater than 1 — 8, no matter what state so it starts in, in 
time polynomial in \S\, \A\, T, 1/8, and 1/e. What makes 
Rmax work is that in each iteration, it either achieves 
a near-optimal reward with respect to the real model or 
learns an unknown transition with high probability. Since 
there are only polynomially-many (s, a) pairs (in the num- 
ber of states and actions) to learn, and each transition entry 
requires K\(T) samples, where K\(T) is polynomial in 
the number of states and actions, 1/e, 1/8, and the e-return 
mixing time T, Rmax clearly runs in time polynomial in 
these parameters. In the case that the e-return mixing time 
T is not known, Rmax starts with T = 1, then considers 
T = 2, T = 3, and so on. We expand on this point further 
in Section|5] for in an MDPU we need to deal with the fact 
that the number of states and actions is unknown. 

3 MDPS WITH UNAWARENESS 

Intuitively, an MDPU is like a standard MDP except that 
the player is initially aware of only a subset of the complete 
set of states and actions. To reflect the fact that new states 
and actions may be learned during the game, the model pro- 
vides a special explore action. By playing this action, the 
DM may become aware of actions that she was previously 
unaware of. The model includes a discovery probability 
function characterizing the likelihood that a new action will 
be discovered. At any moment in game, the DM can per- 
form only actions that she is currently aware of. 

Definition 3.1 : An MDPU is a tuple M = 

(S, A, S ,a , g A , go, P, D, R, R + ,R~), where 

• S, the set of states in the underlying MDP; 

• A, the set of actions in the underlying MDP; 

• So C S is the set of states that the DM is initially 
aware of; 

• ao ^ A is the explore action; 

• gA '■ S —¥ 2 , where g& (s) is the set of actions that 
can be performed at s other than ao (we assume that 
ao can be performed in every state); 

• go : So —> 2 , where go(s) Q gA(s) is the set of 
actions that the DM is aware of at state s (we assume 
that the DM is always aware of ao); 

• P : l) se s({s} x S x gA(s) -> [0,1] is the tran- 
sition probability function (as usual, we require that 

E S 'es p ( s > s '' a) = 1 if a e gA(s)); 

• D : M x ]N x S — > [0, 1] is the discovery probability 
function. D(j, t, s) gives the probability of discover- 
ing a new action in state s € S given that there are 
j actions to be discovered and ao has already been 
played t — 1 times in s without a new action being 
discovered (see below for further discussion); 



• R : U;sps({s} x 5 x su(s)) — > M + is the reward 
function^ 

• R+ : S -> 1?+ and R~ : S -> J?+ give the ex- 
ploration reward for playing ao at state s £ 5 and 
discovering (resp., not discovering) a new action (see 
below for further discussion). 

Let M u = (S, A, g, P, R) be the MDP underlying the 
MDPU M. 

Given Sq and go, we abuse notation and take Aq = 
Uses 5o(s); that is, Ao is the set of actions that the DM 
is aware of. 

Just like a standard MDP, an MDPU has a state space S, 
action space A, transition probability function P, and re- 
ward function i?0 Note that we do not give the transition 
function for the explore action ao above; since we assume 
that ao does not result in a state change (although new ac- 
tions might be discovered when ao is played), for each state 
s G S, we have P(s, s, ao) = 1. The new features here in- 
volve dealing with ao. We need to quantify how hard it is 
to discover a new action. Intuitively, this should in general 
depend on how many actions there are to be discovered, 
and how long the DM has been trying to find a new action. 
For example, if the DM has in fact found all the actions, 
then this probability is clearly 0. Since the DM is not as- 
sumed to know in general how many actions there are to be 
found, all we can do is give what we view as the DM's sub- 
jective probability of finding a new action, given that there 
are j actions to be found. Note that even if the DM does not 
know the number of actions, she can still condition on there 
being j actions. In general, we also expect this probability 
to depend on how long the DM has been trying to find a 
new action. This probability is captured by D(j, t, s). 

We assume that D(j, t, s) is nondecreasing as a function of 
j: with more actions available, it is easier to find a new one. 
How D(j, t, s) varies with t depends on the problem. For 
example, if the DM is searching for the on/off button on 
her new iPhone which is guaranteed to be found in a lim- 
ited surface area, then D(j, t, s) should increase as a func- 
tion of t. The more possibilities have been eliminated, the 
more likely it is that the DM will find the button when the 
next possibility is tested. On the other hand, if the DM is 
searching for a proof, then the longer she searches without 
finding one, the more discouraged she will get; she will be- 
lieve that it is more likely that no proof exists. In this case, 
we would expect D(j,t, s) to decrease as a function of t. 
Finally, if we think of the explore action as doing a random 

1 We assume without loss of generality that all payoffs are non- 
negative. If not, we can shift all rewards by a positive value so that 
all payoffs become non-negative. 

2 It is often assumed that the same actions can be performed 
in all states. Here we allow slightly more generality by assuming 
that the actions that can be performed is state-dependent, where 
the dependence is given by g. 



test in some space of potential actions, the probability of 
finding a new action is a constant, independent of t. In the 
sequel, we assume for ease of exposition that D(J, t, s) is 
independent of s, so we write D(j, t) rather than D(j, t, s). 

R + and R~ are the analogues of the reward function R for 
the explore action ao, Although performing action ao does 
not change the state, we can think of there being two copies 
of each state s, call them s + and s~, which are just like s 
except that in s + the DM has discovered a new action after 
exploration, and in s~ she hasn't. Then R + (s) and R~(s) 
can be thought of as R(s, s + , ao) and R(s, s~ , ao), respec- 
tively. In a chess game, the explore action corresponds 
to thinking. There is clearly a negative reward to thinking 
and not discovering a new action — valuable time is lost; 
we capture this by R~(s). On the other hand, a player of- 
ten gets a thrill if a useful action is discovered; and this 
is captured by R + (s). It seems reasonable to require that 
R~(s) < R + (s), which we do from here on. Since the 
whole point of exploration is to discover a new action, the 
reward for discovering one should be greater than the re- 
ward for not discovering one. 

When an MDPU starts, So represents the set of states that 
the DM is initially aware of, and go(s) represents the set of 
actions that she is aware of at state s. The DM may discover 
new states when trying out known actions, she may also 
discover new actions as the explore action ao is played. At 
any time, the DM has a current set of states and actions that 
she is aware of; she can play only actions from the set that 
she is currently aware of. 

In stating our results, we need to be clear about what the in- 
puts to an algorithm for near-optimal play are. We assume 
that So, go, D, R + , and R~ are always part of the input to 
the algorithm. The reward function R is not given, but is 
part of what is learned. (We could equally well assume that 
R is given for the actions and states that the DM is aware 
of; this assumption would have no impact on our results.) 
Brafman and Tennenholtz [3 1 assume that the DM is given 
a bound on the maximum reward, but later show that this 
information is not needed to learn to play near-optimally in 
their setting. Our algorithm URmax does not need to be 
given a bound on the reward either. Perhaps the most inter- 
esting question is what the DM knows about A and S. Our 
lower bounds and impossibility result hold even if the DM 
knows |S| and |<7,a(s)| for all s G S. On the other hand, 
URmax requires neither |S| nor \gA(s)\ for s 6 S. That 
is, when something cannot be done, knowing the size of the 
set of states and actions does not help; but when something 
can be done, it can be done without knowing the size of the 
set of states and actions. 

Formally, we can view the DM's knowledge as the input to 
the learning algorithm. An MDP M is compatible with the 
DM's knowledge if all the parameters of of M agree with 
the corresponding parameters that the DM knows about. If 



the DM knows only So, go, D, R + , and R (we assume 
that the DM always knows at least this), then every MDP 
(S',A',g',P',R') where S C S' and g (s) C A'(s) 
is compatible with the DM's knowledge. If the DM also 
knows |S|, then we must have \S'\ = | S|; if the DM 
knows that S = So, then we must have S' — So- We 
use -R m ax to denote the maximum possible reward. Thus, 
if the DM knows i? max , then in a compatible MDP, we have 
R(s, s' , a') < i? m ax, with equality holding for some tran- 
sition. (The DM may just know a bound on i? max , or not 
know i? max at all.) If the DM knows i? ma x, we assume that 
i? + (s) < i? max for all s G S (for otherwise, the optimal 
policy for the MDPU becomes trivial: the DM should just 
get to state s and keep exploring). Brafman and Tennen- 
holtz essentially assume that the DM knows |S|, and 
fi maj . They say that they believe that the assumption that 
the DM knows i? max can be removed. It follows from our 
results that, in fact, the DM does not need to know any of 
\A\, |S|,o rJ R max . 

Our theorems talk about whether there is an algorithm for a 
DM to learn to play near-optimally given some knowledge. 
We define "near-optimal play" by extending the definitions 
of Brafman and Tennenholtz [3] and Kearns and Singh 1(1511 
to deal with unawareness. In an MDPU, a policy is again a 
function from histories to actions, but now the action must 
be one that the DM is aware of at the last state in the history. 
The DM can learn to play near-optimally given a state 
space So and some other knowledge if, for all e > 0, 8 > 0, 
T, and s € So, the DM can learn a policy % e ,S,T,s sucn 
that, for all MDPs M compatible with the DM's knowl- 
edge, there exists a time tM,e,S,T such that, with probabil- 
ity at least 1 - S, U M {s, ir e ,s,T,s,t) > Opt(M, e, T) - e 
for all t > ijVf,e,5,T0 The DM can learn to play near- 
optimally given some knowledge in polynomial ( resp., ex- 
ponential) time if, there exists a polynomial (resp., expo- 
nential) function / of five arguments such that we can take 
t M ,e,6,T = f{T,\S\,\A\,l/e,l/6). 

4 IMPOSSIBILITY RESULTS AND 
LOWER BOUNDS 

The ability to estimate in which cases the DM can learn to 
play optimally is crucial in many situations. For example, 
in robotics, if the probability of discovering new actions is 
so low that it would would require an exponential time to 
learn to play near-optimally, then the designer of the robot 
must have human engineers design the actions and not rely 
on automatic discovery. We begin by trying to understand 
when it is feasible to learn to play optimally, and then con- 
sider how to do so. 

We first show that, for some problems, there are no algo- 

3 Note that we allow the policy to depend on the state. How- 
ever, it must have an expected payoff that is close to that obtained 
by M no matter what state M is started in. 



rithms that can guarantee near-optimal play; in other cases, 
there are algorithms that will learn to play near-optimally, 
but will require at least exponential time to do so. These 
results hold even for problems where the DM knows that 
there are two actions, already knows one of them, and 
knows the reward of the other. 

In the following examples and theorem, we use E tiS to de- 
note the event of playing ao t times at state s without dis- 
covering a new action, conditional on there being at least 
one undiscovered action. 

Example 4.1: Suppose that the DM knows that S = S = 

j>i}, go(si) = {ax}, \A\ = 2, P(si,s x ,a) = 1 for all 
action a 6 A, R(si,si,ai) = n, R + (si) = R~(si) = 0, 
t) = , an d the reward for the optimal policy in 

the true MDP is r^, where > r±. Since the DM knows 
that there is only one state and two actions, the DM knows 
that in the true MDP, there is an action that she is not 
aware of such that R(s\, si, 02) = r^- That is, she knows 
everything about the true MDP but the action a^- We now 
show that, given this knowledge, the DM cannot learn to 
play optimally. 

Clearly in the true MDP the optimal policy is to always play 
02- However, to play a^, the DM must learn about 02. As 
we now show, no algorithm can learn about with proba- 
bility greater than 1 /2, and thus no algorithm can attain an 
expected return > (n + ?' 2 )/2 = r 2 — (r 2 — ri)/2. 

Since there is exactly one unknown action, and the DM 
knows this, we have 

Pr{E t , Sl ) = r£=i(l--D(M0) 

= IL'=i (1 - (Fttf) 

_ t+2 
2(7+1) 

> 3- 

For the third equality, note that 1 — (t'+iy 1 = — F+t) x 
(1 + pxr); it follows that 

n( i -(FTiF)=Q4) x G x 9 x '-- 

/ t t + 2\ 

x x • 

\t + l t+lj 

All terms but the first and last cancel out. Thus, the prod- 
uct is 2 (t+i) 7 as desired. The inequality above shows that 
Pr(Et, si) is always strictly greater than 1/2, independent 
of t. In other words, the DM cannot discover the better 
action 0,2 with probability greater than 1/2 no matter how 
many times ao is played. It easily follows that the expected 
reward of any policy is at most (rj + r 2 )/2. Thus, there is 
no algorithm that learns to play near-optimally. | 



The problem in Example 14.11 is that the discovery proba- 
bility is so low that there is a probability bounded away 
from that some action will not be discovered, no mat- 
ter how many times o,q is played. The following theorem 
generalizes Example 14.11 giving a sufficient condition on 
the failure probability (which we later show is also neces- 
sary) that captures the precise sense in which the discovery 
probability is too low. Intuitively, the theorem says that if 
the DM is unaware of some acts that can improve her ex- 
pected reward, and the discovery probability is sufficiently 
low, where "sufficiently low" means D(l, t) < 1 for all t 
and J2^Li D(l> < °°> tnen tne DM cannot learn to play 
near-optimally. To make the theorem as strong as possi- 
ble, we show that the lower bound holds even if the DM 
has quite a bit of extra information, as characterized in the 
following definition. 

Definition 4.2: Define a DM to be quite knowledgeable if 
(in addition to So, go, D, R + , and R~) she knows S = So, 
\A\, the transition function Pq, the reward function Ro for 
states in So and actions in Ao, and i? max - 

We can now state our theorem. It turns out that there are 
slightly different conditions on the lower bound depending 
on whether \S \ > 2 or \S \ = 1. 

Theorem 4.3: If D{l,t) < lforalltandYT=i D ( 1 ^) < 
oo, then there exists a constant c such that no algorithm 
can obtain within c of the optimal reward for all MDPs that 
are compatible with what the DM knows, even if the DM is 
quite knowledgeable, provided that \So\ > 2, |A| > \Aq\, 
and i? max is greater than the reward of the optimal policy 
in the MDP (So,Aq,Po,Ro)- If\So\ = 1, the same result 
holds jf yi t _i D(j, t) < oo, where j = \ A\ — \Aq\. 

Proof: We construct an MDP M" = (S, A",g", P", R") 
that is compatible with what the DM knows, such that no 
algorithm can obtain within a constant c of the optimal re- 
ward in M". The construction is similar in spirit to that 
of Example 14.11 Since \S\ > 2, let si be a state in S. 
Let j = \A\ - \ A \, let A" = A U {ai, . . . ,a,}, where 
ai, . . . , a,j are fresh actions not in Ao, let g" be such that 
g"(si) = g (si) U {ai}, g"(s) = A", for s ± s x . That 
is, there is only one action that the DM is not aware of in 
state si, while in all other states, she is unaware of all ac- 
tions in A — Ao. Let P"(si, si,ai) = P"(s,s\,a) = 1 
for all a e A" - A and s 6 S (note that P" is deter- 
mined by Pq in all other cases). It is easy to check that 
M" is compatible with what the DM knows, even if the 
DM knows that S = So, knows \A\, and knows i? max - 
Let R"(si,si,ai) = R"(s, Sx,a) = i? max for all s ^ s\ 
and a E A — Ao (R" is determined by Ro in all other 
cases). By assumption, the reward of the optimal policy in 
(So, Ao, go, Po, Ro) is less than i? max , so the optimal pol- 
icy is clearly to get to state si and then to play ai (giving 
an average reward of i? max per time unit). Of course, doing 
this requires learning ai. 



As in Example l4.11 we first prove that for M" there exists a 
constant d > such that, with probability d, no algorithm 
will discover action a\ in state s%. 

Again, we have 

Pr{E t , Sl ) = rd, =1 (l- J D(l,f)). 

Since J2"tLi t) < oo, we must have that 
lim t ^ 00 D(l,t) = 0. Since D(l,t) < 1 for all t, there 
must exist a constant ci < 1 such that D(l,t) < c\ for all 
t. If a = 0,thenD(l,t) = for alii > 1 (since D(l,t) < 
c\ by assumption, and D(l,t) > 0, since it is a probabil- 
ity), so Pr(E ttSl ) = 1, and we can take d = 1. If Ci > 0, 
then we show below that l-D(l,t') > (1 - ci)^ 1 '*'^ 1 . 
Thus, we get that 

Pr(£ M ) > n*'=i(l-ci) D(1 ' t,)/ci 

> (i- Cl )E;^w)/c^ 

Since, by assumption, Y^=i ^O-i^) < 00 > we can la ^e 

d=(l-c 1 )E~= l W)M. 

It remains to show that 1 - D(l,f) > (1 - ci) D( - l - t,)/ci . 
Since < D(l,t') < c\ < 1, it suffices to show that 
1 - x > (1 - b) x ' h = e (x/b)ln(l-b) f or o < x < 6 < 1. 
Let g(x) = 1 - x - e ^/b)in(i-b) _ We want to show that 
g(x) > for < x < b < 1. An easy substitution 
shows that g(0) = g(b) = 0. Differentiating g, we get 
that g'(x) = -1 - M^za e (*/6)Mi-6), and g"{x) = 
- ln(1 ~ &)2 e Wfe)Mi-fe) < o. Since 5 (0) = g(b) = and ff 
is concave, we must have g(x) = g((l — x/b)0 + (x/b)b) > 
(1 - x/b)g(0) + (x/b)g(b) = for x G [0, 6], as desired. 

Let T2 be the expected reward of the optimal policy in M", 
and let n be the expected reward of the optimal policy in 
the MDP (S, A" - {ai},P"U_ {ai}j i2"U_ {ai} ). As we 
have observed, T2 > r\. With probability at least d, no 
algorithm will discover a\, so the DM will know at most 
the actions in A" — {a\}, and cannot get a reward higher 
than T2- Thus, no algorithm can give the DM an expected 
reward higher than (1 — d)ri + dr 2 - Thus, we can take 
c = d(ri - r 2 ). 

If | Sb | = 1, essentially the same argument holds. We again 
construct an MDP M" = (So, A", g" , P" , R"). Since 
|So| = 1, all components of M" are determined except for 
R". We take R"(si,sx,ai) = i? max , and R"(si, sx, a) = 
Rmax — 1, for a e A" — (Ao U {ai}). Again, the unique 
optimal policy is to play ai at all times, so the problem re- 
duces to learning a\. Without further assumptions, all we 
can say is that this probability of learning ai after t steps of 
exploration is at most D(j, t), so we must replace D(l,t) 
by D(j, t) in the argument above. 1 

4 We remark that we can still use D(l, t) if the DM does know 
that S = So, but does not know \S\, and \S\ > 2. We can also use 
D(l, t) if the probability of learning the specific action a\ after t 
steps of exploration is 73(1, t). 



Note that Example 14. II is a special case of Theorem 14.31 
since J2Zi JtTW < i*=i V dt = L 

In the next section, we show that if Y^Li D(l,t) = oo, 
then there is an algorithm that learns near-optimal play 
(although the algorithm may not be efficient). Thus, 
YltLi D(l,t) determines whether or not there is an al- 
gorithm that learns near-optimal play. We can say even 
more. If 5^^.1-0(1, t) = oo, then the efficiency of the 
best algorithm for determining near-optimal play depends 
on how quickly Y^tLi-D(l-,t) diverges. Specifically, the 
following theorem shows that if Ylt=i D(l,t) < f(T), 
where / : [1, oo] — > M is an increasing function whose 
co-domain includes (0, oo] (so that is well defined 

for t £ (0,oo]) and D(l,t) < c < 1 for all t, then the 
DM cannot learn to play near-optimally with probability 
> 1 - S in time less than f^ 1 (cln(S)/ ln(l - c)). It 
follows, for example, that if f(T) = mi log(T) + mi, 
then it requires time polynomial in 1/8 to learn to play 
near-optimally with probability greater than 1 — 6, For if 
f(T) = mi log(T) + m2, then /^(t) = e ^ m2 ^ mi , so 
/-HcM^/Kl ~ c)) = r\c ln(l/<5)/ln(l/(l - c))) 
has the form a(l/6) b for constants a,b > 0. A similar 
argument shows that if f(T) = mi ln(ln(T) + 1) + 77i2, 
then /~ 1 (cln(l/(5)/ln(l/(l - c))) has the form ae^/V 
for constants a, b > 0; that is, the running time is exponen- 
tial in 1 /6. We remark that the assumption that D(l,t) < 
c < 1 for all t is not needed in Theorem 14.31 since it al- 
ready follows from the assumptions that D(l,t) < 1 and 
Y^tLi-D(l,t) < oo, since the latter assumption implies 
that limt-K* £>(!,*) = 0. 

Theorem 4.4: If \S \ > % \A\ > \A \, i? max is 
greater than the reward of the optimal policy in the MDP 
(So, Ao, Pq, Ro), y~]/^_ l -D(l) t) — oo, and there exists a 
constant c < 1 such that D(l,t) < c /or aW t, and an 
increasing function f : [1, oo] — > Ft such that the co- 
domain of f includes (0, oo] and Ylt=i D(l,t) < f(T), 
then for all 6 with < 6 < 1, there exists a constant 
d > such that no algorithm that runs in time less than 
/ (crn(5)/ln(l — c)) can obtain within d of the opti- 
mal reward for all MDPs that are compatible with what 
the DM knows with probability > 1 — 6, even if the DM is 
quite knowledgeable. If\So\ = 1, the same result holds if 
EL W) < f(T), where j = \A\ - \A \. 

Proof: Consider the MDP M" constructed in the proof of 
Theorem l4.3l As we observed, M" is compatible with what 
the DM knows (even if the DM knows S = So, \A\, and the 
maximum possible reward i? max )- Note that, for all e > 0, 
the e-return mixing time of M" is 1 . 

We now prove for all 6 > 0, all algorithms require at 
least time / _1 (ln(<5)/ ln(l — c)) to discover a\ in M" 
with probability > 1 — 6. By assumption, there exists 
a constant c < 1 such that D(l,t) < c for all t. We 



must have c > 0, for otherwise D(l,t) = for all t and 
Y^tLi D(l,t) = ^ oo, a contradiction. The same ar- 
gument as in Theorem 14.31 now shows that Pi(E tiSl ) > 
(1 - c )E!'= 1 1, ( 1 ' t ')/ c . Since Y,l=iD(l,t') < f(t), it 
follows that Pr(£ Ml ) > (1 - c) f W c . Note that for the 
probability of discovering a\ to be at least 1 — 6 at state 
s\, we must have Pr(E t . Sl ) < 6, which in turn requires 
that (1 — c)-^*)/ c < 6. Taking logs of both sides and rear- 
ranging terms, we must have f(t) > c\n(6)/ ln(l — c), so 
t > f^ 1 (c\n(6)/ ln(l — c)), since / is increasing. (Note 
that since < 6 < 1 and < c < 1, both ln(l - c) 
and \n(6) are negative, so ln(<5)/ln(l — c) > 0, and 
f^ 1 (c\n(6)/ ln(l — c)) is well defined.) Thus, it requires 
at least time / _1 (cln(<5)/ ln(l — c)) to discover a% with 
probability > 1 — 6. 

Let T\ be the expected reward of the optimal policy in M" , 
and let be the expected reward of the optimal policy in 
the MDP (S,A",P"\ A „_ {ai} ,R"\ A „_ {ai} ). By the con- 
struction of M", n > f2- If di is not discovered, the DM 
will know only the actions in A" — {ai}, and cannot get a 
reward higher than r^. It follows that no algorithm can give 
the DM an expected reward greater than d = (1— 6)r%+6r2 
in time less than f^ 1 (c\n(6) / ln(l — c)). I 

In the next section, we prove that the lower bound of The- 
oreml4T4lis essentially tight: if Y% =1 D(l, t) > f(T), then 
the DM can learn to play near-optimally in time polynomial 
in / _1 (\n(AN/6)) and all the other parameters of interest. 
In particular, if f(t) > mi \n(t) + m,2 for some constants 
mi and 7712, then the DM can learn to play near-optimally 
in time polynomial in the relevant parameters. 

We conclude this section with an observation regarding the 
importance of the DM's information. As we have seen, 
Theorems 14.31 and 14.41 hold even DM has a great deal of 
information; specifically, the DM can know that S = So, 
\A\, and the maximum possible reward. On the other hand, 
our near-optimal policy construction does not require this 
information. It turns out that the assumption that D(j, t) > 
D(l, t) plays a crucial role in these results. While the as- 
sumption seems quite natural (it would be strange if the 
probability of discovering a new action decreased as the 
number of undiscovered actions increases), it is worth not- 
ing that without this assumption, there are cases where the 
DM can find a near-optimal policy if she knows \A\, but 
does not know it otherwise. 

Example 4.5 : A family of MDPUs M* 
{M 1 , M 2 , M 3 , ■ ■ ■ , M°°} are involved in this exam- 
ple, where M l = (S, A\S Q ,A Q , P, R\R+,R-,D) such 
that 5 = Sq = {si}, A i = {at, a 2 , ■ ■ ■ , a*}, A = 0, 
P(si, si, a) — 1 for all action a, R l (si, si, af) = j where 
j E {1, 2, • • • , i}, R + (s, ao) — R~{s 7 a ) — for all state 
s and D(j, t) = -Ap. Note that the same S, So, Ao, P and 
D are shared across the family. The DM is in one of the 



MDPUs among the M* family. The DM knows that she 
is in a member of the M* family, but she does not know 
which. We now show that whether an upper bound of the 
number of actions being given to the DM makes a big 
difference to her performance. 

For all state s that has at least one new action, assume there 
are j s actions unaware to the DM (note that the DM does 
not know the value of j s ), we have 

Pr(E f , s ) = nUi(l-^) 

- y— x jT+2 x jt+3 x j.+t ) 

Js 
t+js • 

First, assume that the DM knows k - an upper bound on the 
actual number of actions in the underlying MDPU. Thus 

k > j s , and 

Pv(E t , s ) < jfe. 

Take t = we get Pr(E t<s ) < 5. Thus, if the DM 

knows k, she is guaranteed to discover all actions with 
probability > (1 — 6) for any S > in time polynomial 
in k and 1/8 no matter which game she is playing. It fol- 
lows easily that she is also guaranteed to achieve the opti- 
mal reward with probability > (1 — 6) for any 6 > in 
polynomial time. 

Now suppose the DM does not know k. We shall prove 
that the DM cannot obtain an optimal reward no matter 
how many times ao is played. Suppose the highest re- 
ward she could get from any currently discovered action 
is r, and suppose she has played ao for t > times with- 
out finding any new actions. Since the underlying MDPU 
can be any member of the M* family, let it be M l where 
i = max(9£, lOr). Thus, the optimal reward is i > lOr, 
which is 10 times of what the DM has currently achieved. 

We now show that this actually happens with constant prob- 
ability. 

> 9t 
— t+9t 

9_ 

10 ' 

In this case with probability > the DM only attains ^ of 
the optimal reward. In fact, we can set i = m&x(nt, nr+r) 
with arbitrarily large n, in which case with probability > 
the DM only attains of the optimal reward. 

In conclusion, if the DM does not know k, she can attain 
an arbitrarily low reward compared to the optimal reward 
no matter how many times she plays ao- On the other hand, 
if she knows k, she is guaranteed to achieve a near-optimal 
reward in polynomial time. | 



5 LEARNING TO PLAY NEAR-OPTIMALLY 

In this section, we show that a DM can learn to play 
near-optimally in an MDPU where ^^.j i)(l, i) = oo. 
Moreover, we show that when Ylt=i^0-^) = 00 > tne 
speed at which D(l,t) decreases determines how quickly 
the DM can learn to play near-optimally. Specifically, if 
Ef=o D 0-i *) ^ mifQnT) + m 2 for all T > for con- 
stant mi > and m-x, and an invertible function /, then 
the DM can learn to play near-optimally in time polyno- 
mial in f~ 1 (l/5). In particular, if / is the identity (so that 
J2t=o D(l,t) > rnilnT + m 2 ), then the DM can learn 
in time polynomial in 1/5 (and, as we shall see, in time 
polynomial in all other parameters of interest). We call 
the learning algorithm URmax, since it is an extension of 
RMAX to MDPUs. While the condition J^^Li D(l,t) = 
oo may seem rather special, in fact it arises in many ap- 
plications of interest. For example, when learning to fly a 
helicopter fUED, me s P ace °f potential actions in which 
the exploration takes place, while four dimensional (result- 
ing from the six degree of freedom of the helicopter), can 
be discretized and taken to be finite. Thus, if we explore by 
examining the potential actions uniformly at random, then 
D(l, t) is constant for all t, and hence Y^tLi t) = oo. 
Indeed, in this case 

Y.t=i D ( 1 ^) is °( T )' so k follows 
from Corollary 15. 5 1 below that we can learn to fly the he- 
licopter near-optimally in polynomial time. The same is 
true in any situation where the space of potential actions in 
which the exploration takes place is finite and understood. 

We assume throughout this section that = 
oo. We would like to use an RMAX-like algorithm to learn 
to play near-optimally in our setting too, but there are two 
major problems in doing so. The first is that we do not 
want to assume that the DM knows \A\, or i? max . We 
deal with the fact that |5| and \A\ are unknown by using 
essentially the same idea as Kearns and Singh use for deal- 
ing with the fact that the true e-mixing time T is unknown: 
we start with an estimate of the value of |5| and \A\, and 
keep increasing the estimate. Eventually, we get to the right 
values, and we can compensate for the fact that the payoff 
return may have been too low up to that point by playing 
the policy sufficiently often. The idea for dealing with the 
fact that i? max is not known is similar. We start with an 
estimate of the value of i? m ax, and recompute the value of 
K\ (T) and the approximating MDP every time we discover 
a transition with a reward higher than the current estimate. 
(We remark that this idea can be applied to Rmax as well.) 
The second problem is more serious: we need to deal with 
the fact that not all actions are known, and that we have a 
special explore action. Specifically, we need to come up 
with an analogue of K\ (T) that describes how many times 
we should play the explore action ao in a state s, with a 
goal of discovering all the actions in s. Clearly this value 
will depend on the discovery probability (it turns out that 
all we need to know is D(l,t) for all t) in addition to all 



the parameters that K\ (T) depends on. 

We now describe the URmax algorithm under the assump- 
tion that the DM knows N, an upper bound on the state 
space S, k, an upper bound on the size of the action space 
A, Umax, an upper bound on the true maximum reward, 
and T, an upper bound on the e-return mixing time. To 
emphasize the dependence on these parameters, we denote 
the algorithm URmax(S , , go, D, N, k, i? max , T, e, 8, s ). 
(The DM may also know R + and R~, but the algorithm 
does not need these inputs.) We later show how to define 
URmax(So, go, D, e, 8, sq), dropping the assumption that 
the DM knows N, k, T and i? max . 

Define 

. Ki(T) = maxar^-^l) 3 , f81n 3 (^)]) + 1; 

. K = mm M {M : ££i^(M) > H^N/8)}. 
(Such a Kq always exists if 5Z t=1 13(1, t) = oo.) 

Just as with Rmax, K\ (T) is a bound on how long the DM 
needs to get a good estimate of the transition probabilities 
at each state s. Our definition of K\(T) differs slightly 
from that of Brafman and Tennenholtz (we have a coeffi- 
cient 8 rather than 6; the difference turn out to be needed 
to allow for the fact that we do not know all the actions). 
As we show below (Lemma l5TTT l. K is a good estimate on 
how often the explore action needs to be played in order to 
ensure that, with high probability (greater than 1 — 5 /47V), 
at least one new action is discovered at a state, if there is a 
new action to be discovered. Just as with Rmax, we take a 
pair (s, a) for a ^ clq to be known if it is played K\ times; 
we take a pair (s, ao) to be known if it is played Ko times. 

\JRmax(So, go, D,N,k,R max ,T,e, 8, so) proceeds just 
like Rmax(/V, k, i? max , T, e, 8, So), except for the follow- 
ing modifications: 

• The algorithm terminates if it discovers a reward 
greater than i? max , more than k actions, or more than 
N states (N, k, and i? max can be viewed as the cur- 
rent guesses for these values; if the guess is discovered 
to be incorrect, the algorithm is restarted with better 
guesses.) 

• if (s, ao) has just become known, then we set the re- 
ward for playing a in state s to be — oo (this ensures 
that a is not played any more in state s). 

For future reference, we say that an inconsistency is dis- 
covered if the algorithm terminates because it discovers a 
reward greater than i? max , more than k actions, or more 
than N states. 

The next lemma shows that K has the required property. 

Lemma 5.1: Let Ko be defined as above. If the DM plays 
ao Ko times at state s, then with probability > 1 — 6/ AN 



a new action will be discovered if there is at least one new 
action at state s to be discovered. 

Proof: Suppose that s is a state where the DM is unaware 
ofj > 1 actions. Then 

Pr(E Ko ,s) = n£=i(l-D&.*0) 

< n^i(i-w))- 

We show below that 1 - D(l,t') < e^ 1 '*'). Thus, 
Pv(E Ko , s ) < Uv=i^ W) 

The choice of Ko guarantees that 53t=i E){1, t) > 
ln(AN/8). Thus, 

Pr(E K(hS ) < e -M4AW = _*_. 



It remains to show that 1 - D(l,t') < e~ D ( 1 ' t '\ Since 
-D(l, f) > 0, it suffices to show that 1-x < e~ x foxx > 0. 
Let g(x) — 1 — x — e~ x . We want to show that g(x) < 
for x > 0. An easy substitution shows that g(0) = 0. 
Differentiating g, we get that g'(x) = — 1 + er x < when 
x > 0. Since g(0) = and g is nonincreasing when x > 0, 
we must have g(x) < for x > 0, as desired. | 

We first show that URmax(S , , go, D, N, k, i? max , T, e, S, s ) 
is correct provided that the parameters are correct. 

Theorem 5.2: For all MDPs M = (S, g, A, P, R) compat- 
ible with Sq, go, N, k, i? max , and T (i.e., S D So, g{s) 3 
g (s)forall s € S , \S\ < N, \A\ < k, R(s,s',a) < 
Rmax. for all s, s' € S and a £ A, and the e-return mixing 
time of M is <T), and all states so £ So* with probability 
at least 1 — 8, URmax(5'o, go, D, N, k, i? max , T, e, 8, So) 
running on M returns a policy whose expected return is 
at least Opt(Af, e,T) — 2e. Moreover, it does so in time 
polynomial in N, k, T, -, i, i? max , and Ko- 

Proof: The basic structure of the proof follows lines similar 
to the correctness proof of Rmax [3 1. (Related results are 
proved in 03Q2Q31Q3].) We sketch the details here. 

The running time is clear from the description of the algo- 
rithm. Let M r = (S r ,A r ,g r , P r ,R r ) be the MDP that is 
finally computed by URmax in execution r of the policy. 
Thus, S r is the set of states discovered in execution r, A r 
is the set of actions discovered, and so on. Although M r 
may not be identical to M, we will show that the set of 
executions r where Opt(M r ,e,T) > Opt(M,e,T) - 2e 
has probability at least 1 — 8, where the probability of an 
execution is determined by the transition probabilities P of 
the actual MDP M. The key points of the argument are the 
following. 



(a) With probability at least 1 — (5/4, every action that can 
be played in a state s G S r is discovered (i.e., is in 

flr(s)). 

(b) If all actions that can be played in s G S r are dis- 
covered, then with probability at least 1 — 5/4, P r is 
close to P; specifically, \P r (s, s', a) — P(s, s' , a)\ < 
4AfTfl max for a11 s ' s ' ^ S r ,a £ g r (s). 

(c) 5 r contains all the "significant" states in S; if (a) 
and (b) hold, with probability at least 1 — <5/4, 
Opt(M r , e, T) > Opt(M, e, T) - 2e. 

Part (a) is immediate from Lemma 15.11 and the fact that 
\S r | < N. Parts (b) and (c) are similar to the arguments 
given by Brafman and Tennenholtz, so we defer the de- 
tails to the full paper. The desired result now follows using 
techniques similar in spirit to those used by Brafman and 
Tennenholtz O ; we leave the details to the full paper. | 

We get URmax(5o, go, D, e, S, so) by running 
URMAX(S ,g ,D,N, k, i? max ,T, e, 5, s ) using larger 
and larger values for N, k, -R max , and T. Sooner or later 
the right values are reached. Once that happens, with 
high probability, the policy produced will be optimal in 
all later iterations. However, since we do not know when 
that happens, we need to continue running the algorithm. 
We must thus play the optimal policy computed at each 
iteration enough times to ensure that, if we have estimated 
N, k, i? max , an d T correctly, the average reward stays 
within 2e of optimal while we are testing higher values 
of these parameters. For example, suppose that the actual 
values of these parameters are all 100. Thus, with high 
probability, the policy computed with these values will 
give an expected payoff that is within 2e of optimal. Never- 
theless, the algorithm will set these parameters to 101 and 
recompute the optimal policy. While this recomputation is 
going on, it may get low reward (although, eventually it 
will get close to optimal reward). We need to ensure that 
this period of low rewards does not affect the average. 

URMAx(S ,go,D,e,5, s ): 

Set N := | Sb |, k := \A \, R max := 1, T := 1 
Repeat forever 

Run URmax((S'o, go, A N, k, i? max , T, e, 5, s ) 
if no inconsistency is discovered 
then run the policy computed by 

URmax((S'o, g , P>, N, k, i? max , T, e, 5, s ) for 
Ki + K3 steps, where 

where K 2 = 2(iVfcmax(X 1 (T + l),K ))§R ma , x /e 
K 3 = (2i? max + 1) max ((2iW) 3 , 8 ln(|) 3 )/e 
N := N + 1; k := k + 1, i? max := R max + 1,T:=T + 

The following theorem shows that 
\JRmax(Sq, go, D,e,S, so) is correct. (The proof, 



which is deferred to the full paper, explains the choice of 
K2 and K3.) 

Theorem 5.3: For all MDPs M = (S, A, g, P, R) com- 
patible with So and go, if the e-return mixing time of M is 
Tm, then for all states so G So, with probability at least 
1 — 5, for all states so G So, URmax(5o, go, D, e, 5, So) 
computes a policy T^e,S,T M ,s such that, for a time tM,a,s 
that is polynomial in \S\, \A\, Tm, 1/e, and Kq, and all 
t > tM,c,8, we have Um{so, ^, t) > Opt(M, e, Tm) — 2e. 

Thus, if D(l,t) = 00, the DM can learn to play 

near-optimally. We now get running time estimates that 
essentially match the lower bounds of Theorem l4.4l 

Proposition 5.4: If J^ =1 D(l,t) > f{T), where f : 
[l,oo] — > 1R is an increasing function whose co-domain 
includes (0, 00], then K~o < / _1 (ln(4iV/(5)), and the run- 
ning time q/URMAX is polynomial in / _1 (ln(4iV/<5)). 

Proof: Immediate from Theorem 15. 3 1 and the definition of 
Ko. I 

Recall from TheoremEHthat if ^Li D ( 1 ^ l ) ^ /( T )' the 
no algorithm that learns near-optimally can run in time less 
than/- 1 (c'ln(l/ < 5)) (where d = c/ln(l/(l - c))), so we 
have proved an upper bound that essentially matches the 
lower bound of Theorem l5.2l 

Corollary 5.5: IfY%=i ^(M) > mi ln(T) + m 2 (resp., 
Y^t=i D(\,t) > mi ln(ln(T) + 1) + mi) for some con- 
stants m\ > and m% then the DM can learn to play near- 
optimally in polynomial time {resp., exponential time). 

Proof: If fiT) = mi ln(T) + m 2 , then as we have 
observed, /^(t) = e ^- m2 ^ m \ so /- 1 (ln(4iV/5) = 
g-ma/mi (4j\r)i/mi . (i/s) 1 /™. Thus, /- 1 (ln(4AT/ ( J) has 
the form a(l/<5) 1 /" 11 , and is polynomial in 1/5. The result 
now follows from Theorem l5.2l The argument is similar if 
~^2it=i D(\,t) > mi ln(ln(T) + 1) + m^, we leave details 
to the reader. | 

6 CONCLUSION 

We have defined an extension of MDPs that we call MD- 
PUs, Markov Decision Processes with Unawareness, to 
deal with the possibility that a DM may not be aware of 
all possible actions. We provided a complete characteriza- 
tion of when a DM can learn to play near-optimally in an 
MDPU, and have provided an algorithm that learns to play 
near-optimally when it is possible to do so, as efficiently 
as possible. Our methods and results thus provide guiding 
principles for designing complex systems. 

We believe that MDPUs should be widely applicable. We 
hope to apply the insights we have gained from this theo- 
retical analysis to using MDPUs in practice, for example, 



to enable a robotic car to learn new driving skills. Our re- 
sults show that there will be situations when an agent can- 
not hope to learn to play near-optimally. In that case, an 
obvious question to ask is what the agent should do. Work 
on budgeted learning has been done in the MDP setting 
0E1E1; we would like to extend this to MDPUs. 
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